首页> 外文OA文献 >Focused crawling: a new approach to topic-specific Web resource discovery
【2h】

Focused crawling: a new approach to topic-specific Web resource discovery

机译:重点爬网:主题特定的Web资源发现的新方法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.
机译:万维网的快速增长对通用爬虫和搜索引擎提出了前所未有的扩展挑战。在本文中,我们描述了一种称为聚焦爬虫的新超文本资源发现系统。专注的搜寻器的目标是有选择地查找与预定义主题集相关的页面。不使用关键字而是使用示例性文档来指定主题。专注的爬网程序不是收集和索引所有可访问的Web文档以能够回答所有可能的即席查询,而是分析其爬网边界以查找最可能与该爬网最相关的链接,并避免Web的不相关区域。这可以节省大量的硬件和网络资源,并有助于使爬网保持最新。为了实现这种针对目标的爬网,我们设计了两个可指导爬网程序的超文本挖掘程序:一个分类器,用于评估爬虫的相关性。有关焦点主题的超文本文档,以及一个识别超文本节点的蒸馏器,这些超文本节点是对几个链接中许多相关页面的重要访问点。我们报告了使用不同主题特异性的多个主题进行的广泛的集中爬行实验。重点爬网稳定地获取相关页面,而标准爬网很快就迷路了,即使它们是从相同的根集开始的。集中爬网对于URL起始集中的大扰动具有鲁棒性。尽管存在这些干扰,它仍会发现大量重叠的资源。它还能够探索和发现与起始集相距数十个链接的宝贵资源,同时仔细修剪可能位于同一半径内的数百万个页面。我们的轶事表明,使用适度的桌面硬件,集中爬网对于针对特定主题构建高质量的Web文档集合非常有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号